Chris Pollett > Old Classes > CS267
( Print View )

Student Corner:
  [Grades Sec2]

  [Submit Sec2]

  [Class Sign Up Sec2]

  [
Lecture Notes]
  [Discussion Board]

Course Info:
  [Texts & Links]
  [Topics/Outcomes]
  [Outcomes Matrix]
  [Grading]
  [HW/Quiz Info]
  [Exam Info]
  [Regrades]
  [Honesty]
  [Additional Policies]
  [Announcements]

HW Assignments:
  [Hw1]  [Hw2]  [Hw3]
  [Hw4]  [Quizzes]

Practice Exams:
  [Mid 1]  [Mid 2]  [Final]

                           












HW#2 --- last modified February 07 2019 23:03:09..

Solution set.

Due date: Oct 10

Files to be submitted:
  Hw2.zip

Purpose: To get experience with our first ranking algorithms using the vector space model as well as the boolean retrieval algorith,.

Related Course Outcomes:

The main course outcomes covered by this assignment are:

LO1 -- Code a basic inverted index capable of performing conjunctive queries.

LO2 -- Be able to calculate by hand on small examples precision (fraction relevant results returned), recall (fraction of results which are relevant), and other IR statistics.

LO6 -- Be able to evaluate search results by hand and using TREC eval software.

Specification:

This homework consists of three parts: A set of book problems, which are to be done individually, should be handwritten (no typing), and must be turned in at the start of class on the day indicated; a coding assignment; and an evaluation task. The latter two parts of this assignment can be done in a team of at most three people. For this portion only one team member needs to submit -- just be sure to mention the names of your team mates in the documentation for your code.

The book problems I want you to do are: Exercise 2.8 and 2.9 (due Sep. 19); Exercise 3.1 and 4.1 (due Oct. 3).

For the coding part of your homework I want you to expand upon the code you wrote for HW1. You will need to modify your code to make it a positional index. Now your program should be run from the command-line with a command like:

program_name documents_file.txt query query_algorithm 

For C/C++, program_name can be just the name of your program. For Java, PHP, Python the program name can be "java program_name" or "php program_name" or "python program_name". Again, you must program in either C/C++, Java, PHP, or Python and your program should work with a vanilla install of an at most one year old version of gcc/g++/Oracle's Java 1.6/Python 2.7/Zend's php. The documents_file.txt is, as in the last assignment, a text file containing documents. Documents are separated by two newline characters. The 0th document is up to the first pair of newlines, this is followed by the 1st document, etc. You should have a readme.txt file saying how to compile and run your program, who is in your group, and their student ids. For this homework, your program does not output the inverted index and related statistics to it. Instead, I want you to store these informations into a class which has the four public methods first, last, prev, and next, described in class for the inverted index ADT. Your implementation of prev and next should use galloping search. The command-line argument query above can be either a single word for example "bob", it can be a string of words for example "bob smith", or it can be a prefix boolean expression built out of * (for AND), + (for OR), and - (for not). If it is more that one term, when I run your program I will put the sequence of terms in double-quotes, so that it will come in to your program as a single command-line argument. As an example of a prefix expression "+ - bob * smith jane" corresponds to the query ((NOT bob) OR (smith AND JANE)). Finally, query_algorithm is either the word "cosine" or the word "boolean". As an example, I might run your program with a line like:

php search_program.php my_document.txt "apple records" boolean

If the query_algorithm is 'cosine' then I promise to only give inputs for the query which are either a single word for example "bob", it can be a string of words for example "bob smith"; for the 'boolean' case the inputs could be any of the three types. Queries without '+', "*', '-' should be treated as conjunctive queries. I promise not give any boolean queries where a negation is applied to something other than a term.

On an input as above, your program should return a sequence of lines, each line containing a pair (doc_id, rank). Here doc_id should be a document which contains the word apple and contains the word records (not necessarily adjacent to each other). Here rank should be the cosine rank score for that document with respect to this query, if the query_algorithm was cosine, and otherwise, `1/(doc_id+1)` in the boolean case. You should output results in descending order of rank. In computing cosine rank, use TF_IDF values for the vector component where the TF function is the example one given in the book.

For the last part of your assignment, I want you to create a small text collection of 20 documents and put them in a file my_test_file.txt as part of what you submit. Make up a list of six queries (two single word, two string of words, two boolean query involving '+', '*', and '-'). For each of your queries and each of the documents in my_test_file.txt make a 1 (yes), 0 (no) judgment about the relevance of that document to that query. Each of your queries should have the key words searched on as well as a description of the words you expect to get back. Number your queries and put your queries and their descriptions in queries.txt. For instance, you might have a query as your query number 1, the query "apple records", with a description of the query being the record company of the Beatles. As part of your description, you might say a document about Apple computers which mentions the word records should not be considered a relevant result for this query. In a separate text file, judgments.txt, you should lists the judgments you came up with above. This file should consist of lines like (query_no, doc_no, relevance) which record this information. Next run your program in cosine mode. View a score of 0 for a document as that document not being returned. I want you to compute by hand the recall, precision, and MAP scores of your search program with respect to each (non-boolean) queries using your judgments. Put your calculations into the file scores.txt

All components of the grade breakdown below will be graded as either 0- did not work/do, 1/2 - partially works/partially achieved, 1 - full credit

Point Breakdown

Book Problems (each problem 1pt - 1 fully correct, 0.5 any defect or incomplete, 0 didn't do or was wrong) 2pts
Program compiles as described above and runs from the command-line using syntax above. 1pt
Program has a working implementation of inverted index ADT using galloping search for next and prev. 1pt
Program outputs results in the format described. 1pt
Program parses boolean query correctly 1pt
Program only documents which have all of the query terms in the cosine case, and correct documents in boolean case 1pt
Program computes cosine rank scores correctly and computes 1/doc_id+1 in boolean case. 1pt
Hw2.zip contains a file my_test_file.txt with 20 documents as described above. Hw2.zip contained the files queries.txt and judgments.txt and they are as described above. 1pt
Hw2.zip contains a file scores.txt and it contains recall, MAP, precision scores calculated correctly for the program results with respect to those queries and cosine ranking. 1pt
Total10pts